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Abstract 

We generalize recent theoretical work on the minimal number of layers of narrow 
deep belief networks that can approximate any probability distri bution on the states of 
their visible units arbitrarily well, from the setting of binary units (ISutskever and Hinton , 
20081 : Le Roux and BengioL I2008L l20ld : iMontufar and AyL 1201 lh to the setting of units 
with finite state spaces. In particular we show that a q-ary deep belief network with 
l(q n — l)/q(q — l)(n — log ? (n) — 1)J layers of width n is a universal approximator 
of distributions on {0, 1, . . . , q — l} n . More generally, we bound the Kullback-Leibler 
model approximation errors from above, depending on the network's depth and the state 
spaces of the units in every layer. We provide complementary results for restricted 
Boltzmann machines with finite state spaces. 

Keywords: Deep belief network, restricted Boltzmann machine, universal approxima- 
tion, representational power, Kullback-Leibler divergence, q-axy variable 



1 Introduction 



A deep belief network dHinton et all l2006h is a layered stochastic network with undi- 
rected bipartite interactions between the units in the top two layers, and directed bipartite 
interactions between the units in all other subsequent pairs of layers, direct ed towards 
the bottom layer. The top two laye rs form a restricted Boltzmann machine (|Smolensky , 
19861 : Freund and Hausslerl Il99lh . The entire network defines a model of probability 
distributions on the states of the units in the bottom layer, the visible layer. When the 
number of units in every layer has the same order of magnitude the network is called 
narrow. The depth refers to the number of layers. Deep belief networks have found a 
great number of applications in machine learning. 
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The representational power of neural network s in general has b een studied for sev- 
eral decades. For instance, a well known result (HorniketaL, 1989h shows that mul- 
tilayer feedforward networks with one exponentially large layer of hidden units are 
universal approximators. The first pr oof of universal approximatio n by deep and nar- 
row sigmoid belief networks is due to lSutskever and Hintonl (120081) . They showed that 
a narrow sigmoid belief network with 3(2 n — 1) + 1 layers is a universal approxi- 
ma tor of probability dis t ributio ns on {0, l} n . This depth bound has been im proved 
by Le Roux and Bengiol ( 2010 ) and subsequently by Montufar and Ay ( 201 ll ). as we 
will discuss below in more detail. These papers have focused on the universal approxi- 
mation depth of networks with binary units (binary DBNs). The purpose of the present 
note is to generalize that work to narrow deep belief networks with finite valued units 
(discrete DBNs). In addition, we give bounds for the maximal Kullback-Leibler model 
approximation errors. We provide analogous results for restricted Boltzmann machines 
with discrete units (discrete RBMs). 



When a DBN has L layers of widths m, . . . ,ul and the units in the /-th layer take 
states x l = {x\ , . . . , x l ni ) 6 X = X\ X • • • X X 1 , then the joint distributions on the 
states of all units have the form 



where 



L-2 



p(x\...,x L ) =q{x L - 1 ,x L )Hp l {x l \x l+1 ); 



1=1 



(1) 



q(x,y) =exp(x © y)/Z(Q 

pi{x\y) = II pi,i( x i\y)'i 

i£[ni] 

p lti (x i \y) = exp(xJ& l i y)/Z(& l i y) 



(2) 
(3) 

(4) 



Here x = (xi, . . . , x n ) denotes the entry-wise one-hot representation of x = (x\, . . . , x n ). 
The matrix © L_1 contains the bias and interaction weights of the top two layers, the ma- 
trix ■ contains the bias and input weights of the of the z-th unit in layer /, and Z is in 
each case the partition function. The DBN probability model consists of the marginal 



distributions p(x 1 ) = Ylx 2 



,p{x 1 1 



on the states of the units in the first layer. 



A model M. of probability distributions on a finite set X is a universal approximator 
iff the maximum Dj^ of the Kullback-Leibler divergence from any distribution on X 
to the closest distribution in A4 vanishes. More formally, D_m := sup„ gA D(p\\ M), 
where A is the set of all distributions on X, and D{p\\M.) := inf^g^ D(p\\q), where 

D(p\\q) ■= ^2 X £xP( x ) log |M when supp(p) C supp(g) and D(p\\q) := oo otherwise. 
We refer to Dm as the universal or maximal approximation error of A4. 



For binary DBNs the following is known: 

Theorem 1. Let DBN be a binary deep belief network probability model with L layers 
of width n = 2 k ~ 1 + k. If L > \ + 2 s for some S G {0, 1, . . . , 2 fc ~ 1 }, then the maximum 
of the Kullback-Leibler divergence from any target distribution on the states {0, 1}™ of 
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Figure 1 : DBN with 5 layers of width n and q 1 , . . . , q n valued units in each layer. 

the visible layer to its best approximation within DBN is bounded by 

£>dbn < (2*" 1 - S) log(2). (5) 
In particular, this probability model is a universal approximator whenever 

L>l + 2 2k ~\ (6) 

Note that 

o n 2fc -! 2 n 

< 2 < —, T\ 7- (7) 



2(ra-log 2 (n)) 2(n - log 2 (n) - 1 



, The- 



The universal approximation statement, eq. ©, is due to Montiifar and Am (1201 
orem 2). It is based on a refinement of previous work by Le Roux and Bengiol ( 2O10l) . 
who obtained the bound L > 1 + ^— when n is a power of two. Th e appro ximation error 
bound, eq. (|5]>, was discussed recently in dMontufar and Morton . 2012 . Theorem 18). 



It is based on the observation that DBNs lacking the universal approximation depth 
still may be universal approximators for subsets of visible units, which can be used to 
describe explicit sub-model classes of DBNs and estimate the approximation errors. 

In the following we say that a unit is q valued if its state space has cardinality q, 
and assume without loss of generality that q > 2. Our generalization of Theorem [T]is 
the following. We make the simplifying assumption that all layers have the same width, 
and that the units in each layer have the same state spaces. See Figure [T] The results 
automatically hold for networks with wider hidden layers or hidden units with larger 
state spaces. 

Theorem 2. Let DBN be a discrete deep belief network probability model with L layers 
of width n and q\ , . . . , q n valued units in each layer. Let m be any integer with n > 

m > YYj=m+2 Qj an d l et 1 = 9i > • • ' > Qm, 1 > 2. If L > 2 + for some S £ 
{0, 1, . . . , m}, then the maximum of the Kullback-Leibler divergence from any target 
distribution on the states of the visible layer, {0, 1, . . . , q\ — 1} X • • • X {0, 1, . . . , q n — 1}, 
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to its best approximation within DBN is bounded by 



£>DBN < log( \\ qj). 

je{m-S] 

In particular, this probability model is a universal approximator whenever 

q m -l 



L > 2 + 



q-l 



When all units are g-ary and the layer width is n = q k 1 + k, then the DBN model 
is a universal approximator whenever L > 2 + 1 — j — . Note that 

q n - 1 q*"' 1 - 1 q n - 1 

< - — < — -- 1 — -. (8) 



q(q- l)(n-log ? (n)) q-l q(q - l)(n - log g (n) - 1) 

Remarks 

Note that, in contrast to Theorem Q] Theorem |2]does not impose conditions on the layer 
width n. 

The set of all distributions on {0, 1, ... , q— l} n has dimension q n — 1. The total num- 
ber of parameters of a q-wy DBN with L layers of width n is (L — l)(n(q — l) + l)n(q — 
1) + n(q — 1). Hence the model is full dimensional only if L > w ( (? _ 1 )( w ( (? _ 1 ) + 2) + 1- 
This is a parameter-counting lower bound for the universal approximation depth. The 
sufficiency upper-bound from Theorem [2] surpasses this by roughly a factor n. Fol- 
lowing observations show that the upper-bound could be tight nevertheless (up to small 
factors of n). 

Probability models with hidden variables can have dimension strictly smaller than 
their parameter count (dimension defect). Moreover, in some cases even full dimen- 
sional models represent only very restricted classes of distributions, as has been ob- 
served for example in tree models with hidden variables. It is known that the smallest 
naive Bayes model universal approximator of distributions on {0, 1, . . . , q — l} n has 
q n ~ 1 (n(q — 1) + 1) — 1 parameters when q is a prime power (see Montufari 2013 



Theorem 13). In this case the parameter counting lower-bound is q n / (n(q — 1) + 1). 

At the same time we believe that the approximation-error bounds are too pessimistic. 
However, computing tight bounds for the maximum of the Kullback-Leibler divergence 
remains very challenging (this is even so for simple probability models without hidden 
variables). 



Outline of the Proof 

We will prove Theorem |2] by first studying the individual parts of the network: the 
RBMs formed by the top two DBN layers (Section©; the individual units with directed 
inputs (Section[3]); the probability sharing that can be realized by stacks of layers (Sec- 
tion 01); and finally, the sets of distributions that DBNs can represent in their bottom 
layer (Section [5]). The proof steps of Theorem |2]can be summarized as follows: 
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Show that the top RBM can approximate any probability distribution with support on 
a set of the form X\ x • • • Xj- x {0} x • • • x {0} arbitrarily well. 

fc+1 n 

For a X\ -valued unit receiving n directed inputs, show that there is a choice of param- 
eters for which the following holds for each possible state h n € X n of the ra-th input 
unit: If the input vector is (hi, /12, • • • , h n ), then the unit outputs h[ with probability 
p hn (hi), where p hn is an arbitrary distribution on Xi for all h n E X n . 

Show that there is a sequence of g J^ 1 stochastic maps p(h) h-> p(v) = J2h p{v\h)p(h) 
(each of them superposing nearly qn probability multi-sharing steps), which maps the 
probability distributions represented by the top RBM to any probability distribution 
on Xi x • • • x X n . 

Show that the DBN can represent tractable classes of probability distributions, and 
estimate their maximal approximation errors. 

The superposition of probability sharing steps is inspired by (ILe Roux an d Benaio, 



20101) . together with the refinements of that work devised in (IMontufar and Ayl 120111) . 



By probability sharing we refer to the process of transferring an arbitrary amount of 
probability from a state vector x' to another state vector x". In contrast to the binary 
proofs, where each layer superposes about 2n sharing steps, here each layer super- 
poses about qn multi-sharing steps, whereby each multi-sharing step transfers proba- 
bility from one state to q — 1 states (when the units are g-ary). 



2 Restricted Boltzmann Machines 

We denote by RBM^- y the restricted Boltzmann machine probability model with hid- 
den units Y\ , . . . , Y m taking states in y = 3^1 x • • • x y m and visible units Xi , . . . , X n 
taking states in X = Xi X • • • X X n . By default RB Ms are defined with bi nary units; 



however, RBMs with discrete units have a ppeared in (Wellin g et all 120051) . and their 
representational power has be en studied in (Montufar and MortonL 120 13k The follow- 



ing result is closely related to (IMontufar and MortonL 1201 3 . Theorem 16): 



Theorem 3. The model RBM^ j; can approximate any mixture p = Y2iLo ar bi- 
trarily well, where po is any product distribution, and p\ is any mixture of — 1) 
product distributions, for all i € [m], whereby supp(pj) n supp(pj) = for all 
1 < i<j < m. 

Here, a product distribution q is a distribution on X that factorizes as q(x\, . . . , x n ) = 
rijefn] Qj(. x j)' where qj is a distribution on Xj for all j 6 [n]. A mixture is a weighted 
sum with non-negative weights adding to one. The support of a distribution p is supp(p) 
{x G X: p(x) > 0}. 

Proof of Theorem\3\ Let Ex denote the set of strictly positive product distributions of 
Xi,..., X n . Let Mr x denote the set of all mixtures of k product distributions from 
Ex- The closure M. k x contains all mixtures of k product distributions, including those 
which are not strictly positive. Let q o q' denote the renormalized entry-wise product 
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with (qoq')(x) = q(x)q'(x)/ Ylx'ex q{x')q'{x'). The model RBM^y can be written, 
up to normalization, as the set 

M l $ l{ o-.-o M l $ ml = R+£ x o (1 + R + M l $ lhl ) o • • • o (1 + R + M l $ mhl ). (9) 



Now consider any probability distributions po G £x,P\ S .M^ 1 ' 1 , . . . ,p' m G .M^ m 1 . 
If supp(j^) nsu PP(Pj) = for all 1 < i < j < m, then the product {\+\' l p' l )o- ■ -o(l + 
A mPm) ise q ualtolI +Ei e H A X> u P tonormalization - Let K = V A o J2 x Pi( x )Po( x ) 
and p'^x) = pi(x)/p (x). Then X p o (1 + Eie[m] = XXo Hence the 
mixture distribution p is contained in the closure of the RBM model. □ 

We now discuss certain classes of probability distributions that can be approximated 
arbitrarily well by RBMs, depending on their hidden units. 

Let g = {A\, . . . , Ak} be a partition of X . The partition model V with partition g 
is the set of all probability distributions on X with constant value on each Ai. Geomet- 
rically, this is the simplex with vertices for all i G [k]. The coarseness of V is 
maxj \Ai\. The Kullback-Leibler dive rgence from partition models is well understood; 



in particular, by (IMatus and AyL I2003L Corollary 1): 

Lemma A.IfV is a partition model of coarseness c, then D-p = log(c). 

Now, RBMs can approximate certain partition models arbitrarily well: 

Lemma 5. Let V he the partition model with partition blocks {xi} x • • • x {xk} x 
%k+l x • • • x Xnfo r all G X it for all i G [k\. When 1 + I]j g [m](l^il ~~ 1) - 
(liie[fc] 1^1)/ max je[fc] \%j l> tnen T 3 can b e approximated arbitrarily well by RBM^j;. 

Proof. Any point in V is a mixture of the uniform distributions on the partition blocks. 
The partition blocks and hence the support of these mixture components are disjoint. 
They are product distributions, that can be written as p Xl ,...,x k = LTie[fc] ^ Ilie[n]\[fe] n *' 
where m denotes the uniform distribution on Xi. For any j G [A;], any mixture of the 
form Ylx -eX ^x j Px 1 ,...,x k is also a product distribution which factorizes as 

( yi n s * n Ui - < io > 

xje*; ie[k]\{j} ie[n]\[k] 

Hence any point in V is a mixture of (n ie ny |^|)/ max^^] \Xf\ product distributions 
of the form given in eq. (TTOl i. The claim follows from Theorem [3j □ 

Lemma [51 together with Lemma |4j implies: 

Theorem 6. If I + E^hO^I - 1) > (IlieA \ x i\)/ max jeA \Xi\ for some A C [n], 
then 

DKBM Xi y < log ( Yl 

i€[ra]\A 

In particular, the model RBM^ y is a universal approximator whenever 



l + 7 (\yj\ - 1 ) > \X\/max\Xi\. 



6 




Figure 2: Star graphical model of conditional distributions. 



1) /{q— 1) hidden units is a universal 
. Theorem |6] g eneralizes previous re 



When all units are q-wy, the RBM with (q^ 1 - 
approximator of distrib utions on {0, 1, . . . , q — l| r 

suits on binary RBMs ( Montufar and Ayi 1201 1[ Theorem 1) and dMontufar et all 1201 1 
Theorem 5.1), where it is shown that a binary RBM with 2 n_1 — 1 hidden units is a 
universal approximator of distributions on {0, l} n , and that the maximal approximation 
er ror decreases at least logar ithmically in the number of hidden units. A previous result 
by lLe Roux and Bengid (|2008l Theorem 2) shows that a binary RBM with 2 n + 1 hidden 
units is a universal approximator of distributions on {0, l} n . 



3 The Internal Node of a Star 

Consider an inwards directed star graph, as shown in Figure El with leaf variables taking 
states in 3^ = x • • • x y m , and an internal node variable taking states in V. Denote 
by Sv,y the set of conditionals on V defined by this network, as in eq. (@): 

p(v\y;Q) = exp(v T Gy)/Z(Gy), for all y G y. (11) 

The model Syy can represent certain stochastic maps that we will use to define a 
probability sharing scheme in Section @] 

Lemma 7. Let Z = {yi} x • • • x {y k -i} x y k x {y k +i} x • • • x {y m } C y, k ^ m, 
let V = y m > and let {q z : z £ Z} be any distributions on V. Then there is a choice of 
the parameters G ofSyy such that 



p(-|y;G) 



q y , ify G 2 
5y m , otherwise 



Proof. Let yj = {0, 1, ... , rj — 1} for all j € [m], and r = |V| = r m . The set of 
strictly positive probability distributions on V is an exponential family £y = {p(v; 6) = 
exp(v T 9)/Z{6) : 6 € R. d }. Let -d v be the parameter vector of a distribution which 
attains a unique maximum at v. Then for any fixed r\ G M. d we have 

lim p(x;r) + K# v ) = S v (x). (12) 

K— >-oo 

To see this, note that p(x; K$ v ) cx p(x; flv) K an d hence lim^^oo p(x; Ki} v ) = 5 V , and 
furthermore, p(x; r\ + K$ v ) oc p(x; rj)p(x; Kd v ). 
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Without loss of generality let Z = x {0} x • • • x {0}. For each z G Z let 
qzi g jg>ci ^ e mat — p( v .^ qzi^ £ g v ma p q can ^ e defined as follows: 

8= [ 0i | 9 2 ] ••• ] m ] ; (13) 
where 0j contains the columns corresponding to yj in eq. (fTTb and 

6i = [9 Q \9 1 1 ••• l^i- 1 ] g R dxri ; 

Qj = [0 | K $ | K tf | • • • | KoM G K dx ^ , for j = 2, . . . , m - 1; (14) 
m = [0 | ifitfi | K 2 tf 2 | ••• | # r _i0 r _i] G M dxr . 

The matrix maps {y : y € 2} into the parameter vectors {9 Z1 : z\ G 3^i} with cor- 
responding distributions {q z : z G Z}. When Kq, . . . , K r _\ G R are chosen such that 
. . . , ||6> ri-1 || < i"C < #1, • • • >#r-l> then, for each y £ y \ Z, the vector y is 
mapped into a parameter vector 0y with p(-\y; y) arbitrarily close to b Vm . □ 

Remark 8. In order to prove Lemma [7] for any Z it is sufficient to show that (i) the 
vectors {y: y 6 Z} are affmely independent, and (ii) there is a linear map mapping 
{y : y G Z} into the zero vector, and y into the relative interior of the normal cone of 
Qv := conv{v : v G V} at the vertex v = h m for all h G y \ Z. 



4 Probability Sharing 

A single directed layer 

Consider an input layer Y\ , . . . , Y m with directed bipartite connections towards an out- 
put layer X\,... ,X n . Denote by £x,y the model of conditional distributions defined 
by this network, as in eq. ©. This model defines a family of linear stochastic maps 
Le - Q ^ Yly^y p{ x \V'i ®)l{y) from the simplex A(y) of distributions on y to the sim- 
plex A(X) of distributions on X . 

For any y <E y and j G [m], we denote by y[j] the one-dimensional cylinder set 
{yi} x • • • x {Vj-i} x yj x {yj+i} x • • • x {y rn }. Similarly, for any A C [m], we 
denote by y[A] the set of all arrays with fixed values {yi}i^[ m ]\A m the entries [m] \ A. 

Applying Lemma|7]to each output unit of Cx,y shows: 

Theorem 9. Consider some {y^} s e[k] £ and let {j s } s e[k}i {*s} s e[fcl — [ m l- 

sefs ?/ s ) are disjoint, and Z is a subset of y containing them, then the image of 

A(Z) by Cyy contains A(Z U se[k] [{j s , i s }]). 



A stack of directed layers 

In the case of binary units, seque nces of probability s haring can be defined conveniently 
using Gray codes, as done in (|Le Roux and BengioL |201fj). A Gray code is an ordered 
sequence of vectors, where each two subsequent vectors differ in only one entry. A 
binary Gray code can be viewed as a sequence of one-dimensional cylinder sets. In the 
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Figure 3: Three multi-sharing steps on {0, 1, 2} x {0, 1, 2, 3}. 

non-binary case, this correspondence is no longer present. Motivated by Theorem |9j 
we will use one-dimensional cylinder sets to define sequences of multi-sharing steps, as 
shown in Figure [3] 

Let qi = \yi\ denote the cardinality of for i € [n], and let m < n. The set 
Z = {0} x • • • x {0} x y m+ i x • • • x y n C y can be written as the disjoint union 
of k = 1^1 one-dimensional cylinder sets, as Z = U k =1 y( s '[m + 1], where 

= (0, . . . , I y { ^ +2 , ...,yk s) ) and {(y«. 2 , . . . , yi s) )} k s=1 = y m+2 x-x^. 

In the following, each set y^[m + 1] will be the starting point of a sequence of 
sharing steps. By Theorem [9] a directed DBN layer can map the simplex of distribu- 
tions A(y^[m + 1] U • • • U y^[m + 1]) surjectively onto the simplex of distributions 
A(y( 1 ) [m + 1, 1] U • • • U y^ [m + 1, k]). The latter can be mapped by a further DBN 
layer onto a larger simplex and so forth. Starting with y^ [m+1], consider the sequence 



(0,0,.. 


.,0 


0. 


ym+2 ' • • 


.,yP)[m + 1,1] 




(0,0,.. 


.,0 


0, 


v w 

ym+2 ' • • 


.,2/W)[m+l,2] 


(15) 


(1,0,.. 


.,0 


0. 


tf (l) 

ym+2 ' • • 


. 52 /W)[m+l,2] 





continued as shown in Table [T] We denote this sequence by G 1 , and its Z-th row (a 
cylinder set) by G l {l). The union [Ji & \ K ^G l (I) of the first K rows, with K = 1 + q\ + 

<7i (72 H h n^Ti 1 is e q ual to ^1 x • • • x y m x y m+1 x {y£+ 2 } x • • • x 

We define k sequences G 1 , . . . , G k as follows: The first m coordinates of G s are 

equal to a permutation of the first m coordinates of G 1 , shifting each column cyclically s 

positions to the right. The last n—m coordinates of G s are equal to (y m +i, 2/^+2' • • • > U^)- 
We abbreviate {s + t} := (s + t — 1) moc j i m \ + 1. Within the first m columns, 

the free coordinate of the Z-th row of G s is s + k, where k is the least integer with 

' — Si=o njlt X ^ Here tne em Pty product is defined as 1. Let q = max^gj! m }. 
We can modify each sequence G s , by repeating rows if necessary, such that the free 
coordinate of the l-th row of the resulting sequence G s is s + k, where k is the least 
integer with I < ^t=o This K ^ oes not depend on s. 

The sequences G s for s E {1, . . . , k}, are all different from each other in the last 
n — m coordinates and have a different 'sharing' free-coordinate in each row. The union 
of cylinder sets in all rows of these sequences is equal to y± x • • • x y n . 
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Table 1: Sequence of one-dimensional cylinder sets. 
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5 Deep Belief Networks 



Proposition 10. Consider a DBN with L > 2 layers of width n, each layer con- 
taining units with state spaces of cardinalities q\ , . . . , q n . Let m be any integer with 
n > m > Y\j =m +2 Qj =: k This model can approximate a distribution p arbitrarily 
well whenever the support ofp is contained in U ae rw U; e ^_ 2 ] G s (l). 

Proof. Note that U]= m +2 % < n < 1 + Ey 6 [„]fe - 1)- By Theorem g] the top 
RBM can approximate the probability simplex on {0} x • • • x {0} x X m +\ x • • • x 
X n arbitrarily well. By Theorem |9] this simplex can be mapped iteratively into larger 
simplices, according to the sequences G s from Section H] □ 

Theorem 11. Consider a DBN with L layers of width n, each layer containing units with 
state spaces of cardinalities q\, . . . , q n . Let m be any integer with n > m > n?=m+2 % 

and q := q\ > ■ ■ ■ > q m . IfL>2+l + q+ -- - + q k ~ 1 = 2 + q ~\ > tnen m ^ s model 
can approximate a partition model V of coarseness nj^i^ Qj arbitrarily well. 

Proof. We use the abbreviation {s + t} := (s + t — 1) mo( j ( m ) + 1. The top RBM 
can approximate a partition model V (on a subset of y) arbitrarily well, whose partition 
blocks are the cylinder sets with fixed coordinate values y s = 0, y{ s +i} = 0, y{ s+ k} = 

0,y m +uym+2, ■ ■ -,yn ] for all y m+ i G y m +i, for all s G [k]. When L > 2 + 1 + 
q + q 2 + • • • + q r , after L — 2 probability sharing steps starting from V, the DBN can 
approximate the partition model arbitrarily well, whose partition blocks are the cylinder 
sets with fixed coordinate values 

y s , y{ s +i} ■■■,y{s+ r }, ym+i , y„+ 2 > • • • > Vn } ; ( 1 6 ) 

for all possible choices of y s , ys s +i\, ■ ■ ■ , 2/{ s + r }, y m +i, for all s G [k]. The maximal 
cardinality of such a block is q\ ■ ■ ■ g TO _ r _x, and the union of all blocks equals y. □ 

Proof of Theorem^ The claim follows using Lemma 0] on the partition models de- 
scribed in Theorem [TT] □ 
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